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ABSTRACT 

A sample of the Myspace artist network is examined to in¬ 
vestigate the relationship between social connectivity and 
audio-based similarity. Audio data from the Myspace artist 
pages is analyzed using well-established signal-based mu¬ 
sic information retrieval techniques. In addition to show¬ 
ing that the Myspace artist network exhibits many of the 
properties common to social networks, we show there is 
an ambiguous relationship between audio-based similarity 
and the social network topology. 


1. INTRODUCTION 

Myspace 1 has become the de-facto standard for web-based 
music artist promotion. Although exact figures are not 
made public, third party estimates suggests there are around 
7 million artist pages 2 on Myspace. 

Artists ranging from amateur to the most commercially 
successful publish Myspace pages. These Myspace artist 
pages typically include some media - usually streaming 
audio, video, or both - and a list of “friends” specifying 
social connections. This combination of media and a user- 
specified social network provides a unique data set that is 
unprecedented in both scope and scale. By examining a 
sample of the Myspace artist network using complex net¬ 
work theory and analyzing the corresponding media us¬ 
ing audio-based music information retrieval techniques, 
we examine the relationship between artists’ social rela¬ 
tionships and the audio-based similarity of their respective 
musical works. 

We begin by reviewing relevant literature from com¬ 
plex network theory and signal-based music analysis in 
Section 2. The methods used to sample the Myspace artist 
network are discussed in Section 3. The analysis methods 
are discussed in Section 4. The implications and sugges¬ 
tions for future work are discussed in Section 5. 


1 http://myspace.com 

2 http://scottelkin.com/archive/2007/05/11/ 
MySpace-St at i st ics . aspx reports as of April 2007 ~25 million 
songs, our estimates approximate 3.5 songs/artist, giving ~7 million 
artists 
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2. RELATED WORK 

This work uses a combination of complex network theory 
and signal-based music analysis. Both disciplines apply 
intuitively to music information retrieval; however, to the 
best knowledge of the authors, the two have never been 
applied simultaneously to a single data set. 

2.1. Complex Networks 

Complex network theory deals with the structure of rela¬ 
tionships in complex systems. Using the tools of graph 
theory and statistical mechanics, physicists have devel¬ 
oped models and metrics for describing a diverse set of 
real-world networks - including social networks, academic 
citation networks, biological protein networks, and the World- 
Wide Web. All these networks exhibit several unifying 
characteristics such as small worldness, scale-free degree 
distributions, and community structure [5, 12]. Let us 
briefly discuss some definitions and concepts that will be 
used in this work. 

2.7.7. Network Properties 

A given network G is described by a set of nodes N con¬ 
nected by a set of edges E. Each edge is defined by the 
pair of nodes it connects (i, j). If the edges imply direc¬ 
tionality, (i, j) ^ (j, i), the network is a directed network. 
Otherwise, it is an undirected network. The number of 
edges incident to the a node i is the degree ki .In a directed 
network there will be an indegree k\ n and an outdegree 
pout corresponding to the number of edges pointing into 
the node and away from the node respectively. 

Degree distribution: The degree distribution P(k) is 
the proportion of nodes that have a degree k. The shape 
of the degree distribution is an important metric for clas¬ 
sifying a network - scale-free networks have a power-law 
distribution [12] while random networks have a Poisson 
distribution. The scale-free degree distribution is a prop¬ 
erty common to many real-world networks. Conceptually, 
a scale-free distribution indicates the presence of a few 
very-popular hubs that tend to attract more links as the 
network evolves [5, 12]. 



Average shortest path: Two nodes i and j are con¬ 
nected if a path exists between them following the edges in 
the network. The path from i to j may not be unique. The 
geodesic path dij is the shortest path distance from i to j 
in number of edges traversed. For the entire network, the 
average shortest path or mean geodesic distance is Z. In a 
small-world network the mean geodesic distance is small 
relative to the number of nodes in the network [5, 12]. The 
largest geodesic distance in a network is known as the di¬ 
ameter. 

2.1.2. Networks and Music 

Quite naturally, networks of musicians have been studied 
in the context of complex network theory - typically view¬ 
ing the artists as nodes in the network and using either 
collaboration, influence, or similarity to define network 
edges. These networks of musicians exhibit many of the 
properties expected in social networks [4, 6, 14]. 

2.2. Signal-based Music Analysis 

A variety of methods have been developed for signal-based 
music analysis, characterizing a music signal by its tim¬ 
bre, harmony, rhythm, or structure. One of the most widely 
used methods is the application of Mel-frequency cepstral 
coefficients (MFCC) to the modeling of timbre [10]. In 
combination with various statistical techniques, MFCCs 
have been successfully applied to music similarity and 
genre classification tasks [3, 11, 13]. A common approach 
for computing timbre-based similarity between two songs 
or collections of songs creates Gaussian mixtures models 
(GMM) describing the MFCCs and comparing the GMMs 
using a statistical distance measure. Often the earth mover’s 
distance (EMD)[15], a technique first used in computer 
vision, is the distance measure used for this purpose. The 
EMD algorithm finds the minimum work required to trans¬ 
form one distribution into another. 

2.3. Bringing It Together 

Recently some work has been conducted exploring the in¬ 
terplay between user/artist generated metadata and con¬ 
tent based similarity/retrieval. Most of this work [7, 16] 
focuses on various ways of exploiting the human gener¬ 
ated metadata to filter content prior to, or instead of, con¬ 
ducting content-based analysis, similar to the techniques 
discussed in 2.2, in order to reduce computational load. 

3. SAMPLING MYSPACE 

The Myspace social network presents a variety of chal¬ 
lenges. For one, the massive size prohibits analyzing the 
graph in its entirety, even when considering only the artist 
pages. Therefore we sample a small yet sufficiently large 
portion of the network. Also, the Myspace social network 
is filled with noisy data - plagued by spammers and or¬ 
phaned accounts. We limit the scope of our sampling in a 
way that minimizes this noise. And finally, there currently 


is no interface for easily collecting the network data from 
Myspace. Our data is collected using web crawling and 
HTML scraping techniques 3 . 

3.1. Artist Pages 

It is important to note we are only concerned with a sub¬ 
set of the Myspace social network - the Myspace artist 
network. Myspace artist pages are different from standard 
Myspace pages in that they include a distinct audio player 
application. We use the presence or absence of this player 
to determine whether or not a given page is an artist page. 

A Myspace page will most often include a top friends 
list. This is a hyperlinked list of other Myspace accounts 
explicitly specified by the user. The top friends list is lim¬ 
ited in length with a maximum length of 40 friends (the 
default length is 16 friends). In constructing our sam¬ 
pled artist network, we use the top friends list to create 
a set of directed edges between artists. Only top friends 
who also have artist pages are added to the sampled net¬ 
work; standard Myspace pages are ignored. We also ig¬ 
nore the remainder of the friends list (i.e. friends that 
are not specified by the user as top friends), assuming 
these relationships are not as relevant. This reduces the 
amount of noise in the sampled network but also artifi¬ 
cially limits the outdegree of each node. Our sampling 
method is based on the assumption that artists specified as 
top friends have some meaningful musical connection for 
the user - whether through collaboration, stylistic similar¬ 
ity, friendship, or artistic influence. 

The audio files associated with each artist page in the 
sampled network are also collected for feature extraction. 
Cached versions of the audio files are downloaded and au¬ 
dio features are extracted. 

3.2. Snowball Sampling 

There are several network sampling methods; however, 
for the Myspace artist network, snowball sampling is the 
most appropriate method [1,9]. In this method, the sam¬ 
ple begins with a seed node (artist page), then the seed 
node’s neighbors (top friends), then the neighbors’ neigh¬ 
bors, are added to the sample. This breadth-first sampling 
is continued until a particular sampling ratio is achieved. 
Here, we randomly select a seed artist 4 and collect all 
artist nodes within 6 edges to collect 15,478 nodes. If 
the size of the Myspace artist network is around 7 mil¬ 
lion, then this is close to the 0.25% sampling ratio sug¬ 
gested for accurate degree distribution estimation in sam¬ 
pled networks. However, it is insufficient for estimating 
other topological metrics such as the clustering coefficient 
and assortativity [8]. Of course, a complete network topol¬ 
ogy is not our primary concern here. 

3 Myspace scraping is done using tools from the MyPySpace project 
available at http : //mypyspace . sorcef orge . net 

4 The artist is Kama Zoo, Myspace url: http:// 

www.myspace.com/index.cfm?fuseaction=user. 
viewProfile&friendID=134 9012 08 
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undirected 

15478 

91326 

11.801 

4.479 

9 

directed 

15478 

120487 

15.569 

6.426 

16 


Table 1. The network statistics for the My space artist net¬ 
work sample where n is the number of nodes, m is the 
number of edges, (k) is the average degree, l is the mean 
geodesic distance, and d max is the diameter, as defined in 
Section 2.1.1. 

With snowball sampling there is a tendency to over¬ 
sample hubs because they have many links and are easily 
picked up early in the breadth-first sampling. This prop¬ 
erty would reduce the degree distribution exponent and 
produce a heavier tail but preserve the power-law nature 
of the network [9]. 

4. ANALYSIS 

We begin by analyzing the structure of the Myspace artist 
network sample - showing that it conforms, in most re¬ 
spects, to the topology expected from such a social net¬ 
work. We then develop a simple metric for exploring the 
interaction between signal-based music similarity and the 
network structure. 

4.1. Network Analysis 

The Myspace artist network sample exhibits many of the 
network characteristics common to social networks and 
other real-world networks. Some of the network statistics 
are summarized in Table 1. 

Although the network is constructed as a directed net¬ 
work, for our purposes we convert to an undirected net¬ 
work to simplify analysis. Each edge is considered bi¬ 
directional, that is (z, j) = (j, z), and if a reflexive pair of 
edges existed in the directed graph, only one bi-directional 
edge exists in the undirected graph. An examination of the 
directed graph is reserved for later work. 

The degree distribution for the undirected network is 
plotted in Figure 1 on a log-log scale. As mentioned ear¬ 
lier, it is common to find a power-law degree distribution 
in social networks [12]. However, exponential degree dis¬ 
tributions have been reported previously in some types of 
music recommendation networks [4]. This is especially 
true for networks with imposed degree limits. For mod¬ 
erate degree values (35 < k < 200), our sample shows 
a power-law distribution. For lower degree values, the 
distribution is closer to exponential. This may be related 
to the fact that our network has an out degree limit im¬ 
posed by Myspace restricting the maximum number of top 
friends (k out < 40). The power-law fit also breaks down 
for high values of k - most likely due to the limited scope 
of our sample. Similar “broad-scale” degree distributions 
have been reported for citation networks and movie actor 
networks [2]. 



Figure 1. The cumulative degree distributions for the 
Myspace artist network sample. For moderate values of 
k , the distribution follows a power-law (indicated by the 
dotted line), but for low and high values the decay is ex¬ 
ponential. 

4.2. Signal-based analysis 

MFCCs are extracted from each audio signal using a Ham¬ 
ming window on 8192 sample FFT windows with 4096 
sample overlap. These FFT windows are gathered into 
100ms non-overlapping frames. All MFCCs are created 
with the fftExtract tool 5 . For each artist node a GMM is 
built from the concatenation of MFCC frames for all songs 
found on each artist’s Myspace page (generally between 1 
and 4 songs although some artists have more). Ann x n 
matrix is populated with the earth mover’s distance A ij 
between the GMMs corresponding to each pair of nodes 
in the sample. 

4.3. Relationship with signal-based measures 

We explore a simple relation between audio signal dissim¬ 
ilarity and network structure using a box and whisker plot. 
The plot is shown in Figure 2. For all pairs of artists z and 
j, the EMD dissimilarity is found(A^). These dissimilar¬ 
ities are grouped according to the geodesic distance in the 
undirected network between the artist nodes z and j , dij. 
There appears to be no clear correlation between these 
A values and geodesic distance. The Pearson product- 
moment correlation coefficient confirms this giving a p of 
—0.0016, with a p value of 1.50 x 10 -20 . This should be 
viewed in the context of the number of pairwise relation¬ 
ships used, implying it is stable, at least for the community 
of artists found via this sample of the network. 

5. DISCUSSION AND FUTURE WORK 

Whatever slight trend may seem evident in Figure 2, it 
is clear from the Pearson p shown in Section 4.3 that no 
correlation exists in this set. Clearly then an attempt to use 

5 source code at http://omras2.doc.gold.ac.uk/software/fftextract/ 













Figure 2. The box and whisker plot showing the spread of 
pair-wise artist dissimilarity grouped by geodesic distance 
as found on the artist graph. 

geodesic distance to predict acoustic similarity would not 
success. While artists that are friends may sound similar, 
the assertion of friendship cannot be taken to imply an 
acoustic similarity. 

The lack of a correlation seen between the artist net¬ 
work and the EMD dissimilarities is certainly enough to 
consider more exhaustive exploration of how content based 
retrieval systems can interoperate in the ever-growing space 
of social networks. The relationship seen in this sample is 
clear enough, but does it extend into the entirety of the 
Myspace network? More generally, does this trend oc¬ 
cur uniformly across different communities of artists or 
are there significant differences between communities of 
artists as to the sonic similarity of friends? 
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